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The estimation of commuting flows at different spatial scales is a fundamental problem for different areas of 
study. Many current methods rely on parameters requiring calibration from empirical trip volumes. Their 
values are often not generalizable to cases without calibration data. To solve this problem we develop a 
statistical expression to calculate commuting trips with a quantitative functional form to estimate the model 
parameter when empirical trip data is not available. We calculate commuting trip volumes at scales from 
within a city to an entire country, introducing a scaling parameter a to the recently proposed parameter free 
radiation model. The model requires only widely available population and facility density distributions. The 
parameter can be interpreted as the influence of the region scale and the degree of heterogeneity in the 
facility distribution. We explore in detail the scaling limitations of this problem, namely under which 
conditions the proposed model can be applied without trip data for calibration. On the other hand, when 
empirical trip data is available, we show that the proposed model's estimation accuracy is as good as other 
existing models. We validated the model in different regions in the U.S., then successfully applied it in three 
different countries. 



Good estimates of how many people frequently travel between two places is related to several areas of study. 
It is not only a key ingredient in modeling the spreading of infectious diseases 1 " 4 , but also a fundamental 
problem in geo-spatial economics of facility distribution and transportation planning. Clues on the 
statistics of human movements and their changes at different scales are thus fundamental. Here we explore 
statistical expressions to calculate the number of commuting trips within areas of sizes ranging from a few 
kilometers (one city or town) to over one thousand kilometers (one country). We focus on commuting trips 
because they are stable in time and account for the largest fraction of the total flows in a population. 

In order to describe the commuting flow patterns of people, various aggregated trip distribution models have 
been proposed 5 " 10 . Among all these efforts, the gravity model, which assumes that the number of movements 
between an origin- destination pair decays with their distance, is the most widely used one 61112 . The gravity model 
can be further divided into the unconstrained gravity model 5 , the singly constrained gravity model (such as the 
B.P.R model 13,14 , or the Voorhees gravity model 11 ) and the doubly constrained gravity model 15 . More constraints 
generally leads to better model performance. The doubly constrained gravity model fits parameters in an iterative 
way to best reconstruct the empirical OD matrix. But the iterative fitting complicates the calibration of the 
distance decaying function, which is essential for the gravity model. However, the most important limitation 
of the doubly constrained gravity model is the calibrated parameters from the empirical trip data in one area lack 
meaningful interpretation and thus are not transferable to other regions. As a consequence, in current epidemi- 
ology studies the more generalizable but less accurate unconstrained gravity model is often used 1 . 

From another perspective, the recently proposed radiation model 16,17 has an analytical formulation that 
estimates trip volumes without parameters; using as input the distribution of population only. It takes inspiration 
from the intervening opportunity model, which assumes that the trip volume is more related to the number of 
opportunities between the origin and the destination instead of just to their distance 8,18 . As is recently shown by 
Simini et at. 17 , the radiation model represents a general framework that includes the intervening opportunity 
model as a degenerated case and it is linked with the the gravity model through the spatial distribution of 
opportunities. The model was originally developed to predict inter-county flows, which not necessarily account 
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for daily trips. Some recent works 19 " 21 have shown that the para- 
meter-free radiation model 16 does not work well at predicting 
intra-city trips. Two of these works 19 ' 20 introduced models with a 
calibrating parameter as a cost function of the distance to reproduce 
intra city trips. In the gravity model based approach 19 a scaling para- 
meter depending on region size is introduced; while in intervening 
opportunity model based approaches introducing such an interpret - 
able parameter that can be estimated without trip data remains an 
open question. 

The present work seeks to answer two questions: How the value of 
the parameter that imposes the scale dependency can be interpreted 
and estimated without trip data? Under which conditions the pre- 
diction of models not calibrated with empirical trip data would work? 
The extended radiation model that we present in this study is distinct 
from previous formulations, in that it is a stochastic model that 
depends on the distributions of opportunities and population only. 
The one scaling parameter a depends on the region size and the 
heterogeneity of opportunity distribution, which makes it interpret- 
able and estimable in many cases even without trip data. The 
extended radiation model is derived by relating the trip production 
and attraction process to survival analysis. We propose an analogy 
between survival time t and the number of opportunities a a com- 
muter has considered. Another important ingredient for modelling 
trips within small scales (intra urban trips) is the separation between 
population density and trip attraction rates. Most models on intra- 
city trips use the density of population 20,21 as a proxy for both trip 
generation and trip attraction rates. While this approximation is 
reasonable at large scales, at inner city scale trip attraction is better 
represented by the density of point of interests (POIs), defined as 
geolocated non residential establishments presented on a digital map. 

The proposed model combines the closed analytical form of the 
original radiation model and the flexibility of gravity-like models. 
While it can be calibrated when empirical trip data is available, it also 
provides an analytical parameter estimation when there is no trip 
data for calibration. The model is validated in the U.S. by the census 
commuting data and in three other countries (Portugal, Dominican 
Republic, and Rwanda) by cell phone records. 

Results 

Evaluation of the gravity model and the radiation model. The 

simplest form of the gravity model is 1 " 4 : 



nfnf 



(i) 



where Ty is the flow between zone i andj. n t and rij are the population 
of the two zones, is the distance between them and C is a distance 
decaying function. y\ and /3 are parameters to be fitted from data, y is 
an adjustment parameter controlling the sum of the flows. This is 
usually called the unconstrained gravity model because it does not 
guarantee the attainment of the desired generation and attraction 
marginal volumes in each zone. 

In transportation planning, the gravity model usually takes the 
form 11 ' 15 ' 22 : 



T 
1 11 



riiPPiDj 



(2) 



where O z and D ; are the total trip production and attraction volumes 
of zones i and; respectively. For a study region with AT zones, there are 
2N parameters of r\ { and Pj. These parameters are calculated by itera- 

tively applying: r\ { = l/^PjDjCfcj) and ^ = l/^.^O.C^). 
This is called the doubly constrained gravity model because it ensures 
consistent values of the trip production y^ .T^- = O z , and trip attrac- 
tion ^^.Tij = Dj per zone. In order to calibrate the rji and para- 



meters, the model requires accurate input of the total trip production 
and attraction volumes O,- and Dj. 

An alternative model that does not require calibration is the 
recently proposed radiation model, in which Ty takes the form: 



Or 



m+Sij) (ni + nj+Sij 



(3) 



where Sy is the population within the circle of radius centered at 
zone i (not including the population in zones i and;) and the rest of 
the notations are the same as in the gravity model. 

We explore the suitability of the doubly constrained gravity model 
and the radiation model on predicting commuting flows at three 
different scales: the Western U.S., the entire San Francisco Bay area, 
and the city of San Francisco (see Methods). 

We apply the doubly constrained gravity model with power dis- 
tance decay function C(r) = (which is better than the exponential 
decay function in the example regions) and compare it with the 
radiation model. Fig. 1(a) shows the commuting distance distri- 
bution P(r) of different models at the three scales of study. When 
we compare inter-county trips in the Western U.S. both the para- 
meter-free radiation model and the calibrated gravity model with 2N 
+ 1 parameters perform similarly. Adjusting the Y\ b and the para- 
meter k in the distance decaying function can not fit the model well 
for both short distance trips and long distance trips. This confirms 
the results reported in 16 . 

However, when trying to predict the commuting flows among 
zones within the Bay area or San Francisco, without parameters for 
calibration the situation is much harder because the density of popu- 
lation is more homogeneously distributed and commuters tend to go 
to various business districts across the area (see Fig. 2). Such scale of 
daily trips is where the calibration parameters start playing an 
important role and the calibrated gravity model performs better than 
the parameter-free radiation model. In order to inspect further this 
situation, the distribution of the total number of opportunities a 
between trip origin and destination is calculated. Fig. 1(b) shows that 
at the Bay area scale (zone size / ~ 10 km), there is a region a < a avg 
where there is not a clear functional form on the enclosing number of 
opportunities a between the origin and the destination (a avg is the 
average number of opportunities in a zone). While for a > a avg the 
probability of finding a trip start monotonically decaying. This effect 
of clear decaying behavior for a > a avg is not observed in commuting 
trips within San Francisco. Based on these observations we look for a 
way to introduce the effects of scale on the radiation model. 

Extension of the radiation model. We introduce a scaling parameter 
a by combining the derivation of the original radiation model with 
survival analysis. Survival analysis uses statistical methods to deal 
with the analysis of time to events, such as life time distribution of 
living organisms or machine components 27 . The two objects of 
primary interest are the survival function and the hazard function. 
The survival function, S(f), represents the life time distribution of an 
entity: 



S(t)=Pr(T>t) 



(4) 



S(t) is the cumulative probability of having survived after time t. The 
hazard function, h(t), represents the conditional death rate at time t: 



h(t)- 



lim 

dt^O 



Pr(t<T<t + dt\T>t) 
dt 



(5) 



Different forms of hazard function h(t) will lead to different survival 
function S(t). The two most generally used hazard functional forms 
are h(t) = X and h(t) = Xat a ~ l . hit) = X leads to the exponential 
survival function S(t) = e~ x \ while h(t) = Xaf~ l leads to the Weibull 
survival function S(t) = e~ Xt ° \ For a — » 0 the hazard function does 
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Figure 1 | Comparison of the census data, the calibrated doubly constrained gravity model, the parameter-free radiation model, and the calibrated 
extended radiation model, (a) The three columns represent San Francisco, the Bay area, and the Western U.S. respectively. The three rows are the results 
of the three different models compared with the census data. The radiation model gives relatively good prediction only at the Western U.S. scale. At the 
two smaller scales the radiation model under-estimates long distance trips. The doubly constrained gravity model gives close predictions to the census 
data at the three scales. The extended radiation model, with one parameter a, achieves the same prediction quality. The dashed line is a guide to the eye 
with the distance decaying function P(r) = 100(r + 10) ~ 2 7 . (b) The P{a) distribution, is the probability of measuring a commuting trip with a 
opportunities between the origin and the destination. Because the radiation model is not suitable for the two smaller scales, here only the extended 
radiation model and the doubly constrained gravity model are compared with the census data. The flat distribution of P(a) in the Census data within San 
Francisco differs from the other two scales, showing that the distribution of intra-city flows is influenced less by the number of opportunities between the 
origin and destination. 



not depend on time f, this is the effect that we would like to replicate 
for smaller zones: job selection has low dependence on the number of 
opportunities between home and the selected work location. 

The intervening opportunity model can be derived under the sur- 
vival analysis framework if we think of the departure from origin as 
'birth' and the arrival at the destination as 'death' while the lifetime is 
measured as the number of opportunities, a, between the origin and 
the destination. The survival function S(a) represents the cumulative 
probability of not finding a workplace within a opportunities. S(a) = 
e~ la and S(a) = e~ lcf ' are exactly the two most commonly used inter- 
vening opportunity models 818 . Note here that a controls the slope of 
the decay of S(a). 

We show here the radiation model can also be derived under the 
survival analysis framework. If P>(a) represents the probability of 
not accepting the closest a opportunities, it has the same meaning of 
the survival function S(t) in the survival analysis. A person commut- 
ing from a region of n { opportunities to a region of rij opportunities 
with s opportunities in between, can be expressed as the conditional 
probability of accepting one of the rij opportunities between a and a 
+ rip given that the closest w,- opportunities are not chosen. Note that 
a = Hi + s for Eq. 3. This probability can be written as: 

/ , x P>(a)-P > (a + n j ) 
P(Mn i ,n i ,a)= " p> ^ * (6) 

Then the core question becomes how to get the expression of P>{d). 
Different individuals should have different hazard rates (or job 
expectations in this case) 28 ' 29 , we can assign different parameter A z - 
value to different individuals i. So the survival function of the entire 
population from a given origin becomes: 



P > (a)=E[e-' 
If we define p{X) = e~ l and X 

r + co 

p>(*)= / 

J ) 



e- Xa p{X)dl 



r + co 

']=/ < 

J 0 

e (0, +oo) 

-/. 



-la 



(0, 1): 



+ GO -i 
-Ad -/r 1 

e e ua = 



l + a 



(7) 



(8) 



which gives the radiation model expression introduced in Eq.6 (we 
are using the same notation as in 17 ). One reason of choosing p(X) to 
be an exponential distribution is it connects the survival analysis with 
the derivation of the original radiation model 17 . We further elaborate 
the influence of the form of thep(2) distribution in the SI Appendix. 
If we extend the analysis to the Weibull survival function, we get: 

P + GO 



P>(a) 



■p{X)dk-- 



1 



l + a a 



Eq. 6 becomes: 



P{\\ni,nj,aij) = 



Tij = ymi 



(«5 + l)[(^ + n/)" + l] 

between two 5 
d: 

P(l|n / ,n ; -,a z) ) 
J2 k PWnin k ,a ik ) ' 



(9) 



(10) 



In order to calculate the flows between two zones i and j, two nor- 
malization constants are needed: 



(ii) 



since not all people are commuting and also the commuting within 
each zone is not counted, y is the percentage of population that is 
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Trip Production Trip Attraction Pop Density POI Density (#/km 2 ) 




Figure 2 | Trip production and attraction at different scales, (a) A map showing the three selected regions of study representing: the Western U.S., the Bay 
area, and San Francisco. The color bar represents population density, (b-e) Commuting trip generation rate, trip attraction rate, the population density 
and the POI density in (#/km 2 ) in San Francisco. While the population density has high correlation with the commuting trip generation rate, the 
POI density has high correlation with the commuting trip attraction rate. In contrast, at the scale of the Western U.S., the population density correlates with 
both the trip generation rate and the attraction rate, as shown in Table 2. The maps are generated from ARCGIS using TIGER/Line Shapefiles. 



commuting between different zones in the study area. m* is the 
population at the origin. If the empirical trip data or cell phone 
records (as will be shown in the sections below) are available, y can 
be calculated from the total number of observed trips. If neither data 
source is available, the flow distribution can still be calculated 
because y doesn't influence the relative ratio of flows between differ- 
ent OD pairs. The denominator in the equation is a normalization 
constant for finite sized area. The influence of the border effect and 
how this formulation can solve some of the limitations in previous 
models is detailed in the SI Appendix. 

The extended radiation model is calibrated to the three regions 
examined in the previous section. The obtained a values are 0.003, 
0.05 and 1.5 respectively. We use the common part of commuters 
(CPC), based on the Sorensen index 30 , to quantitatively measure the 
goodness of flow estimation. 



CPC 



2NCC 



nc(t)+nc(t) 



(12) 



ncc(t,t)=^ E^^K^'^)^ 0 ^)^! Y.U T * 

This index shows which part of the commuting flow is correctly 
estimated, 0 means no agreement found and 1 means perfect estima- 
tion. Table 1 shows that the extended radiation model gives estima- 
tions with similar performance to the doubly constrained gravity 
model at the three regions while the original radiation model's 
estimation power decays with the region granularity. The goodness 
of fit of the extended radiation model is close to other recently pro- 
posed models 19 . The difference is that the study in 19 uses actual 
commuting flow generation and attraction volumes as input, while 
in the current model we use more easily acquired population and POI 
density as proxies, but achieved the same level of accuracy. 

Absence of data to calibrate a. The P>(a) distribution plays an 
important role in the formulation and a parameter calibration, 



thus worths further scrutiny. When the space is infinite and the 

opportunities are continuous P > (a)= — - is a monotonically 

decreasing function with slope given by the a value. But if we are 
considering trips only within a finite sized region, this implies a finite 
numbers of opportunities possible, up to a tot . Thus P>(a tot ) = 0. 
Also, we divide a study region into a finite number of zones « ceHs , 
so a can only take a set of discrete values. We define a avg = a tot /n ce u s . 
a min is the smallest number of opportunities in all the zones. Since 
within zone trips are not considered, P>{a) should start to decrease 
after a min . This value is not known a priory but may be approximated 
by a avg in the absence of data on trips. We correct the expression in 
Eq. 9 to account for these effects as: 

l l_ 

, /\\ l + fl a 1 + af t / \ 

{P > {a)) = — i — — — j— ,a t ot >a>a a vg (13) 



Now, we explore how Eq. 13 can reproduce the P>(a) measured from 
data. The top panel of Fig. 3(a) shows the results of P>(a) calculated 
from the census commuting data in San Francisco, the Bay area and 
the Western U.S. The solid lines show Eq. 13 with different a values, 
note that the two limiting values of a avg < a < a tot determine the 
range of the equation. By comparing Eq. 13 with the data as seen in 
Fig. 1(b) and Fig. 3(a), we see that in the intra-city scale the P>(a) 
distribution does not decay beyond a avg) so we can't use Eq. 13 to 
estimate the radiation model parameter. For the Bay area and the 
Western U.S. scale, a min ~ a avg , Eq. 13 works well and the value of a 



Table 1 | CPC 


values for different models at different regions 




Western U.S. 


Bay area 


San Francisco 


Ext. Radiation 


0.51 


0.67 


0.65 


Gravity 


0.5 


0.64 


0.66 


Radiation 


0.43 


0.4 


0.23 
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Figure 3 | Interpretation of the parameter a. (a) Effects of the scale on the P>(a) distribution, showing as symbols the results from census data of San 
Francisco, the Bay area and the Western U.S. The black solid lines are evaluations of Eq. 13 for different values of a, and the dashed line is the expected a 
value using Eq. 14. The analytical predictions work for the Bay area and Western U.S. From top to bottom we see the P>(a) distribution for three regions 
with similar sizes. Given a fixed scale, the a value is influenced by the heterogeneity of the distribution of opportunities. The more heterogeneous the 
region is, the larger the difference between a avg and a mim as is shown in Las Vegas-L.A. region. In these cases the prediction of a (Eq. 13) will not resemble 
their calibrated values and thus calibration is needed, (b) For each zone scale /, 200 regions with random centers are selected within west U.S. In each case 

the corresponding a value is calibrated with census trip data. The functional relationship between a and / is a = 

80 percentile a value for each scale. The inset shows the same plot in logarithmic scale. Marked as solid blue circles is the scale range that a values can be 
predicted without data calibration. The calibrated results with trip data for U.S. regions are marked as red squares, while the examples from other 
countries are marked as green triangles. They all follow the functional approximation. 



. The error bar shows the 20 and 



(0.1 < a < 2) should increase with the scale /. For a fixed scale /, if the 
heterogeneity of opportunity distribution increases, then a min further 
differs from a avg . In those cases a > 2 and it is not possible to estimate 
a without data calibration (as shown for Las-Vegas-Seattle in 
Fig. 3(a)). The detailed explanation is in the rest of this section. 

We further explore how the parameter a systematically changes as 
the size / of the commuting zones changes. We evaluate the com- 
muting within regions divided into n ce u s = 100 zones of size / ranging 
from a few kilometers to over 100 kilometers (See Fig. 3 and Fig. 4). 
We randomly chose 200 study regions for each scale with total popu- 
lation of at least 5, 000 X / in order to avoid unpopulated regions such 
as national parks. The census commuting OD data is used to calibrate 
the a value in each region using Eq. 1 1. Fig. 3(b) shows how a is close 
to zero for / < 10 km and starts to increase as a power function 
beyond that value. The functional relationship as a solid line is: 



( 1 

i — 



\36[fc« 



(14) 



The error bars show the 20th and 80th percentile of the a value at 
each scale. The three cases calibrated before: San Francisco, the Bay 
area, and the Western U.S. are marked in red squares. They are all 
close to the expected values calculated from Eq. 14. For trips within a 



city and up to metropolitan urban areas the a value is close to 0 and 
the error bar is narrow. In this limit (a —> 0) we have: 



P(l|n z? n j? ay) = lim 



[(a^n^-a^ « + l) 

l ^(^+i)[K+n ; -r+i] (15) 

aij + njY-afj 



lim 

a^0 



The ratio between [a^ + n^f — a| and is: 



lim 



- = lim a — 



a^0 afj 



(16) 



The a will cancel out when substituting the expression back to Eq. 11. 
The detailed derivation is in the SI Appendix. This implies that the 

predicted flow is proportional to — , while in the parameter-free 

ay 

Hi 

radiation model, the flow is proportional to . We have mentioned 
that within such short distances Eq. 13 is not suitable for the exact 
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Figure 4 | The influence of the opportunity homogeneity on the parameter a value. The two grey rectangles are of the same scale: 600 km. The 
population distribution of the southern region has only two sharp peaks: Los Angeles and Las Vegas, while the population in the northern region is more 
homogeneously distributed. The right section of the figure shows one example OD pair in each region with similar m iy n iy rij and Sy values. But the distance 
between Los Angeles and Lake Havasu City is much longer than the distance between Seattle and Wenetchee, which makes its commuting flow volume 
much smaller. This effect of distance is taken into consideration by the difference in the a value. For the southern region a = 5 while for the northern one 
a = 1.6. The grey regions are combined statistical areas which usually include one or more populated metropolitan areas. 



estimation of a. But given that at such scale 0 < a < 0.1, and the 
model is not sensitive to the parameter value when a —> 0, in zones 
that / < 10 km Eq. 13 is able to generate a reasonable approximated a 
value for trip volume estimation. 

In the range / ~ 10. . .65 km, 0.1 < a < 2 most commonly accounts 
for inter city trips, the model without data calibration is expected to 
predict trips well because of the narrow error bar. For larger scale 
regions enclosing trips between two or more combined statistical 
areas such as the ones shown in Fig. 4, a > 2 and has a wide range. 
We notice that the main driving factor influencing the a value for the 
same scale is the differences in the homogeneity of facility density; 
which are highly correlated to population density at these large 
scales. 

In Fig. 4, the two marked regions have the same scale: / = 60 km. 
The population distribution of the southern region has two sharp 
centers, Los Angeles and Las Vegas, while the rest has low population 
density. In the northern region, the population is more homoge- 
neously distributed. One example OD pair is shown for each region 
on the right part of Fig. 4: From Los Angeles to Lake Havasu City for 
the southern region and from Seattle to Wenatchee for the northern 
region. They have similar m b n b rij and sy values. According to the 
original radiation model, they should have similar flow volumes. But 
in the census there are only 26 people commuting from Los Angeles 
to Lake Havasu City while there are 167 commuting from Seattle to 
Wenatchee. The reason is quite clear on the map: the distance from 
Seattle to Wenatchee is only 150 km while the distance from Los 
Angeles to Lake Havasu City is much longer because of the low 
population/opportunity density between the origin and the destina- 
tion. To put it in another way, people have to travel further to be able 
to explore the same amount of opportunities. This causes the cali- 
brated a value of the southern region to be 5, much larger than the 
northern one, which is 1.6. As is shown in Fig. 3(a), the more het- 



erogeneous the distribution of population is; the larger the difference 
in a~iy. and and the larger the a value becomes. In those cases 
the P>(a) from empirical trip data differs more from Eq. 13 and 
empirical trip data is needed for parameter calibration. More quant- 
itative ways to measure the influence of the degree of heterogeneity as 
a cost function depending on the distance between the origin and the 
destination remains an open question for further studies. 

People's location choices are not influenced by the choice of study 
region sizes. What the parameter a captures for the scale dependency 
is the granularity of aggregation. Ideally the location choice should be 
modeled to the smallest spatial granularity, then aggregated to the 
desired granularity level. But in practice such fine grained input data 
are usually not available, in such cases the a parameter helps the 
model estimation at the desired granularity directly, without requir- 
ing finer grained data. 

In summary, even without empirical trip OD data, if the commut- 
ing zones are in the range / ~ 1...65 km, it can be expected for the 
extended radiation formula to give good commuting flow prediction. 
In these cases Eq. 14 gives us: 0 < a < 2. For zones with large sizes, if 
the opportunity distribution do not have strong heterogeneity (a min 
~ a av g )> me monotonic increase of a with scale / is usually captured 
by Eq. 13. In other situations the model needs to be calibrated with 
empirical OD data. 

Multi-regional study and the role of cell phone data. Not many 
countries in the world have detailed census data for commuting flow 
prediction and model calibration. Those countries with data scarcity 
are often developing countries that need this kind of modeling the 
most. For these countries finding an alternative data source to 
provide guidance for their urban growth, economic planning and 
epidemics controlling is a pressing need. In this section we show 
how the extended radiation and the gravity model can be 
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Figure 5 | Validation of the IPF expression method and commuting patterns in different regions, (a-c) The comparison of the distribution of 
commuting distance, the distribution of the number of commuters between O-D pairs, and the comparison of the census commuting flow with the 
expanded cell phone users' commuting flow T{j. The close fitting in all three figures shows that we can recover the commuting patterns of the whole 
population from the seed matrix provided by the cell phone records, (d-h) Comparison of the distributions of commuting distance for Lisbon, Santo 
Domingo, Kigali, Portugal, and Rwanda. The extended radiation model can be successfully applied to all these cases and the corresponding a versus / 
relationship is marked in Fig. 3. 



calibrated given estimated commuting trips measured from cell 
phone data. More importantly we can compare the phone data 
calibrated parameter a with the one predicted from Eq. 14 to 
explore the generality of that expression. 

Cell phone records are increasingly showing the potential to 
become a data source of valuable information 3133 since most popu- 
lated areas have cell phone service coverage and the value of cell 
phone data in modeling human mobility has recently been high- 
lighted in various studies 1 ' 34 " 36 . For instance, in Rwanda there is no 
detailed commuting census data available. Even if there were, the 
high migration rate of people would make the census outdated 
quickly. However, the country has 215, 030, 420 cell phone records 
from one cell phone service provider in just three months. In this 
section cell phone records are used to extract a seed commuting OD 
matrix, which is expanded using iterative proportional fitting to 
estimate the full commuting OD matrix for the whole population 
under study. 

We use the Bay area as an example to validate the method. 
Fig. 5 (a-c) shows the comparison of the results of the distribution 
of commuting distance, the distribution of the number of commuters 
between O-D pairs, and the comparison of the census commuting 
flow Ty with the expanded cell phone user commuting flow Tjj. The 
close fitting in all the three figures shows that we can recover the 
commuting patterns of the whole population from the seed matrix 
provided by cell phone records. For countries that do not have popu- 
lation density census statistics for the IPF expansion, we can use the 
Landscan 37 population density estimation which is available world- 
wide at 1 km 2 resolution. 



We then extended our study to three different countries: Portugal, 
Dominican Republic and Rwanda. We selected the capitals in the 
three countries: Lisbon, Santo Domingo (including the greater met- 
ropolitan area), and Kigali; and also did the analysis for the entire 
Rwanda and Portugal (we do not have the cell phone information 
available for the entire Dominican Republic). We calibrated the grav- 
ity model and the extended radiation model to test how much can 
they recover these regions' commuting patterns. The results are 
shown in Fig. 5 (d-h). The difference in the commuting distance 
distribution in these regions are captured by both models. The values 
of / and a for the extended radiation model of these regions are 
marked as triangles in Fig. 3(b). All of them conform to the func- 
tional form of a observed from the U.S. regions. This shows that the 
relationship between a and the scale / is generalizable, in this case we 
could have used the extended radiation formula in these countries to 
estimate trips, in the absence of trip data to calibrate the model. 

Discussion 

In summary, we propose an extension to the radiation model that can 
be calibrated with one scale parameter to predict commuting flows at 
different spatial scales. The scale parameter a modulates the influ- 
ence of the opportunity distribution heterogeneity and the spatial 
scale / of the commuting zones. 

Our results are then compared with the benchmark model cali- 
brated with trip data, known as the doubly constrained gravity 
model. A multi-scale study shows that both the extended radiation 
model and the doubly constrained gravity model give close estima- 
tions to the census commuting flows. 
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Table 2 | Correlation between commuting trip generation, attrac- 
tion, population and POI density 

Western U.S. Bay area San Francisco 

Population POI Population POI Population POI 

Trip Generation 0.993 0.926 0.971 0.491 0.956 0.292 
Trip Attraction 0.989 0.930 0.417 0.859 0.157 0.880 



The main advantage of the proposed modelling framework is that 
it still can be applied to predict number of commuting trips when 
lacking data for calibration. The a parameter depends on the scale of 
the study region and the homogeneity of the population distribution. 
We show that for Eqs. 10 and 11 to give good results, the size of the 
n ce u s commuting zones is in the range / ~ 1...65 km, representing 
0 < a < 2. These values of a imply mild heterogeneity in the distri- 
bution of opportunities among zones; which here means that the 
minimum number of opportunities a min enclosing trips is close to 
a avg (the average number of opportunities for all the zones, a avg = 
a tot ln ce u s ). Other quantitative ways to measure the degree of hetero- 
geneity remains an open question for further studies. For larger 
regions (/ > 65 km) the a value range is wide because the heterogen- 
eity of population distribution is highly variable at this scale. In these 
cases the model is better used with empirical data for parameter 
calibration rather than estimating a from Eq. 14 directly. 

The presented study provides the first building blocks for a multi- 
scale generator of human mobility expressed as a functional form of 
the distributions of population and job facilities. We tested it in 
different scales at different countries and discussed its range of 
applicability. We share the sample of the U.S. county level commut- 
ing flow prediction on our web-page to help in this direction 38 . 

Methods 

Input data preparation. The three regions, San Francisco, San Francisco Bay Area, 
and the Western U.S., are shown in Fig. 2(a). The Western U.S. is divided into 183 
counties while the two smaller regions are divided into n ce u s =100 zones to calculate 
the ODs. Each zone is a cluster of blocks determined by applying k-means clustering 
method on the 7, 348 census blocks in San Francisco and 117, 219 blocks in the Bay 
area. Note that the unconstrained gravity model is not compared here because when 
there is empirical OD for parameter calibration, the doubly constrained gravity model 
performs much better. Detailed comparisons between the two gravity models are in 
the SI Appendix. 

We first test the usual assumption in both models: the population density could 
represent both the commuting trip generation and attraction rates at different scales. 
We use the 2010 census LEHD Origin- Destination Employment Statistics 
(LODES) 23 , which provides home and employment locations for the entire U.S. 
population at block level. The first column in Table 2 shows the correlations between 
densities (#/km 2 ) of commuting flow generation, attraction and population in the 
Western U.S. Both of them have high correlations, so at this scale the assumption 
holds. Fig. 2(b, c) shows the commuting trip generation and attraction rates in San 
Francisco. Their distributions are less similar. Thus, we need to find a better proxy for 
commuting trip attraction rates at smaller scales. 

Digital traces of facilities are available on-line, they provide good estimates of 
commuting trip attractions 24 " 26 . In this study we use the density of point of interests 
(in #/km 2 ) of each zone to represent the commuting trip attraction mte(#/km 2 ). The 
three study regions contain 1, 774, 154; 319, 170 and 85, 230 POIs extracted from 
Google Places respectively. According to Table 2, at all the scales POI density has high 
correlation with the commuting trip attraction rate. A more detailed multi- scale 
correlation analysis is in the SI Appendix. 

Seed cell phone OD matrix expansion. Each cell phone record has a time stamp and 
a corresponding cell phone tower. For each user, the most frequently used tower 
between 6PM and 6AM is assigned as the home location and the most frequently used 
tower during day is assigned as the work location. Using the 2010 census home and 
employment location data as a benchmark, we chose the Bay area as an example to 
validate that the cell phone data could provide accurate predictions to commuting 
flow patterns. The sample includes 189, 621 cell phone users. We mapped the 892 cell 
phone towers in the Bay area to the previously defined 100 block clusters to get the 
commuting OD matrix for the cell phone users. The iterative proportional fitting 
(IPF) method is performed to expand the cell phone user OD matrix to the OD matrix 
for the entire population 39 . The basic procedure is first getting the distribution of 
population and POIs to represent the marginal distributions of commuting trip 



generation and attraction rates for each block cluster. Then iteratively adjusting the 
elements of the seed matrix to let them match the desired margins. 
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